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1.  STATEMENT  OF  RESEARCH  PROBLEM 


The  systolic  algorithm  approach  is  currently  the  most  effective  design  proce¬ 
dure  for  minimizing  computing  time  and  the  number  of  processors.  Systolic  arrays 
(SAs)  are  examples  of  VLSI  special  purpose  processor  networks  that  realize  compu¬ 
tationally  expensive  algorithms  in  hardware,  and  can  achieve  massive  concurrency. 
Such  arrays  are  numerously  used  for  the  real-time  implementation  of  digital-signal 
processing  algorithms,  because  of  the  properties  that  these  algorithms  possess  for 
their  effective  mapping  onto  high-speed  SA  architectures.  The  main  problem  in  a 
systoiic  approach  is  that  extra  delays  are  needed  to  assure  proper  timing  and  syn¬ 
chronization,  which  will  collectively  slow  down  computation  and,  thereby,  reduce  the 
throughput  rate.  For  a  large-scale  array,  this  synchronization  can  become  partic- 
ulariy  tedious.  Moreover,  aigorithms  for  asynchronous  SAs  (i)  have  not  been  fully 
utilized  for  data  reduction,  and  (ii)  the  conversion  of  sequential  input  signals  into 
input  blocks  that  will  alter  the  organization  of  the  systolic  mesh  has  not  been  per¬ 
formed. 

The  research  sponsored  by  this  grant  focused  on  the  following  important  tech¬ 
nical  issues  that  result  in  the  development  of  efficient  SAs. 

•  Data  reduction  techniques  via  the  utilization  of  algorithm  properties. 

•  Conversion  of  sequential  input  signals  into  input  blocks  by  means  of  a  spiral 
systolic  mesh  that  is  suitable  for  parallel  processing  and  that  is  flexible  for  ena¬ 
bling  various  array  dimensions. 

•  Making  data  streams  independent  of  computations  executed  in  each  processor, 
thus  reducing  waiting  time. 


The  main  objective  of  this  research  grant  is  to  develop  a  theoretical  and  tech¬ 
nological  basis  for  designing  an  asynchronous  SA  —  that  is  a  hybrid  of  SA  and 
data-flow  approach  —  one  capable  of  converting  input  signals  into  input  blocks  for 
producing  a  faster  and  more  flexible  system.'  The  key  to  this  design  is  a  spirai  SA 
structure,  self-timed  processors,  and  communication  protocois  to  get  control  of  data 
streams  so  that  each  computation  can  start  if  all  its  data  are  available.  Moreover,  the 
communication  protocols  resolve  the  data-flow  conflicts  created  by  the  merging  of 
the  spiral  and  asyrichronous  SA  architectures.  The  SA  processes  the  input  signai 
efficiently  and  eliminates  the  complex  shift  register  organization  of  traditional  filter 
realizations.  Incorporated  in  this  design  are  maximum  parallelism  and  pipelinability. 
trade-off  among  computations,  communications,  and  memory.  Furthermore,  the 
systoiic  array  will  use  simple  local  interconnections  without  undesirable  properties, 
such  as  preloading  input  data  or  global  broadcasting. 


'  For  most  digital  signal  processing  algorithms. 


2.  SUMMARY  OF  RESULTS 


The  research  performed  during  this  grant  focused  on  the  improvement  of  pres¬ 
ent  methods  of  designing  systolic  arrays  [1-7]  for  the  following  digital  signal  proc¬ 
essing  algorithms  [8-15]: 

•  discrete  Fourier  transform  (DFT) 

•  inverse  discrete  Fourier  transform  (IDFT) 

•  convolution 

•  correlation 

•  linear  phase  filter  (LPF) 

•  discrete  Hadamard  transform  (DHT) 

•  one-step  adaptive  lattice  predictors 

•  arbitrarily  large  LMS  adaptive  filters 

This  required  a  comprehensive  research  effort  that  addresses  theoretical,  sys¬ 
tem,  and  performance  aspects. 

Theoretical  Aspect;  Describe  useful  mathematical  representations  for  digital-signal 
processing  (DSP)  algorithms. 

System  Aspect;  Develop  a  technique  for  tailoring  the  algorithms  into  forms  suitable 
for  mapping  onto  SA  architectures  and  to  develop  array  systems  for  implementing 
the  architectures  efficiently. 

PekTormance  Aspect;  Evaluate  the  performance  of  the  new  algorithms  developed 
especially  for  the  asynchronous  array. 


2.1  Theoretical  Aspect 

The  mathematical  representations  for  DSP  algorithms  include  important  prop¬ 
erties  such  as  local  and  regular  data  broadcasting.  On  the  basis  of  these  represen¬ 
tations,  we  have  arrived  at  a  design  procedure  for  creating  a  systolic  array  and  a 
systolic  algorithm  for  the  realization  of  DSP  algorithms  efficiently.  For  example,,  lat¬ 
tice  filters  offer  the  following  features: 

1.  They  are  very  efficient  structures,  because  of  their  simultaneous  production  of  the 
forward  and  backward  prediction  errors. 

2.  The  various  stages  of  the  lattice  predictors  are  decoupled  from  each  other. 
Moreover,  the  backward  prediction  errors  generated  by  the  stages  of  a  lattice 
predictor  are  orthogonal  to  each  other  for  wide-sense  stationary  input  data. 

3.  They  are  modular  in  structure;  an  order  increase  is  induced  by  adding  one  or 
more  stages  without  affecting  previous  computations. 

4.  The  structure  of  each  of  the  stages  is  identical,  making  lattice  filters  viable  can¬ 
didates  for  VLSI  implementation. 
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Figure  1  illustrates  the  diagram  of  a  one-step  predictor  of  order  3. 


Fig.  1.  One-step  adaptive  lattice  predictor.  L  =  3. 

The  forward  and  backward  signals  are  labeled  S/ *  and  s',  *,  respectively  The 
relations  describing  a  one-step  adaptive  lattice  predictor  are 

®o,  k  ~  ^  o.  k  ~ 

=  fc-f  K,  0  <  /  <  L  -  1 

s'|  +  ^.k  =  f<l.kS|.k  +  s'|J^_^  0</<f.-1  (1) 

k  =  ^Lk 
^b.  k  -  k 

where,  e,  *  and  *  are  the  respective  forward  and  backward  predicted  errors.  The 
LMS  algorithm  for  this  lattice  is 

^l.k  +  \  ~  ^l.k~  +  1,  feS  ,  0  <  /  <  L  —  1  (2) 

Here,  is  the  so-called  convergence  parameter,  which  is  time  invariant.  This 
parameter  is  set  to  different  values  from  stage  to  stage  [13].  The  data  reduction 
technique  can  be  demonstrated  via  the  theoretical  aspect  of  the  LPF  digital  filters. 
The  one-dimensional  FIR  filter  is  mathematically  described  as  [11] 

N 

y{k)  =  '^(o{n)x(l<  -  n k=1,  2 .  N  +  L-^  (3) 

n  ^  1 

where  y{k)  is  the  filter  output,  x{m)  is  the  filter's  input  signal,  m  =  1.  2,  ...,  L,  and 
(i)(n]  :?  the  filter's  impulse  response. 

For  LPF  digital  filters,  the  following  four  cases  are  considered. 

Case  1,  2:  Even  filter  order  (+  for  even  and  -  for  odd  symmetry): 

JL 

2 

yW  =  (o{n)  [x  (/r  -  n  -I-  1 )  ±  x{k  -  N  +  n)']  (4) 

n  =  1 

a){n)  =  ±  (i){N  -I-  1  -  n) 
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Case  3,  4;  Odd  filter  order  (  +  for  even  and  -  for  odd  symmetry) 


N  -  1 
‘  2 

yik)  = 

n  =  I 


u){n)  [x  {k 


(,)(n)  =  ±  (i){N  +  1  -  n) 


n  f  1)  ±  x(/( 


-  /V  I-  n)  I  -i  ij{k) 


for  even  symmetry 
for  odd  symmetry 


(5) 


Because  of  the  similarities  between  the  first  two  and  last  two  cases,  case  1  and 
case  2  are  combined  into  one  class,  and  case  3  and  case  4  are  combined  into  another 
class,  with  the  respective  reduced  filter  orders  of  NI2  .  A//2  ,  {N  i  1)/2,  and 
{N  -  1)/2  for  SA  implementation  [16J. 

A  summary  of  the  theoretical  aspect  for  the  remaining  DSP  algorithms  is  listed 
in  Appendix  A.' 


2.2  System  Aspect 

A  fundamental  problem  in  the  design  of  SA  structures  is  to  obtain  localized 
algorithms,  namely,  localization  of  data  dependencies.  To  achieve  maximum  paral¬ 
lelism  in  an  algorithm,  data  dependencies  during  the  computations  must  be  deter¬ 
mined.  The  following  notations  can  be  useful  for  the  system  aspect  of  SA  design: 

Definition  1:  A  dependency  graph  shows  the  dependency  of  the  computations  that 
occur  in  an  algorithm.  An  algorithm  is  computable  if.  and  only  if,  its  dependency 
graph  contains  no  loops  or  cycles. 

Definition  2:  A  localized  dependency  graph  has  only  local  dependencies,  i  e..  the 
length  of  each  dependency  arc  is  independent  of  the  problem  size. 

Definition  3:  A  computational  graph  is  a  localized  dependency  graph  with  each  node 
in  the  graph  labeled  by  the  indices  of  the  terms  it  computes. 

For  SA  architecture,  determining  the  computational  graph  for  a  given  algorithm 
is  a  suitable  starting  point  for  non-structurally  altered  arrays.  For  example,  spiral 
SAs  have  been  structurally  altered  to  conform  to  an  architecture  capable  of  proc¬ 
essing  blocks  of  input  signals,  rather  than  sequential  input  signals.  The  computa¬ 
tional  graph  for  the  one-step  adaptive  lattice  predictor  is  shown  in  Fig.  2.  The  large 
circles  represent  processing  elements  (PEs).  and  the  first  and  second  digits  of  the 
small  circles  represent  one  of  nine  operations,  shown  in  Fig.  3.  and  the  PE  number 
respectively. 


^  This  list  is  presented  to  maintain  continuity  with  the  system  and  performance  design  aspects  of  the  DSP 
algorithms 


h 


Fig.  2.  Computational  graph  for  r,  *  and  r* 

The  .^mputational  scheme  illustrates  identical  sets  of  operations  in  various 
stages.  This  enables  us  to  determine  the  interconnection  scheme  which  reveals  the 
PE  organization  and  PEs'  input  and  output.  The  computational  and  interconnection 
scheme  for  the  one-step  adaptive  lattice  predictor  are  illustrated  in  Figs  3  and  4.  and 
the  internal  operations  of  a  PE  are  shown  in  Fig.  5  [17].  The  system  aspect  of  the 
SA  design  for  the  DSP  algorithms  is  listed  in  Appendix  B  [18-19] 


Fig.  3.  Computational  scheme  for  F.f  ^  and  r.f, 


Fig.  4.  Interconnection  scheme  for  e,  *  and  r,,  *. 


Fig.  5.  Proposed  PE*  for  f.,  *  and  F.t  * 
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2  2.1  Spiral  Architecture 


To  illustrate  this  method,  we  describe  herein  the  process  involved  in  designing 
a  spiral  SA  for  an  LPF  digital  filter:  An  input  signal  x(m)  is  passed  from  the  input  x„ 
of  a  PE  at  (/.  j)  to  its  output  x^^i  (see  equations  4  and  5)  Using  zero  or  one  delay,  this 
input  is  transferred  to  the  input  x,„  of  the  neighboring  cell  (/  i-  1 ,  j').  The  index  j'  can 
be  identified  as  ;  f  1  or  y  -  1  using  an  analogy  from  optics.  Input  data  x{k)  moves 
between  the  PEs  in  the  direction  c/,  —  5  or  c/,  =  3  until  it  reaches  the  left  or  right 
boundary  of  the  array,  respectively  (see  Fig.  6(a)  and  (b)).  Then  the  input  is 
reflected"  back  in  the  respective  direction  d,  =  3  or  c/?  =  5,  with  the  intermediate 
direction  d^  =  4.  The  intercell  propagation  of  intermediate  values  y{k),  of  input  data 
y{k)  can  be  described  by 

y(/(),  =  y{k),  _  ,  f  «»(/)  •  x(/(  -  /  +  1 )  (6) 

where,  yik)^  =  0  and  y(k),  =  >(/<)  A  delay  element  is  positioned  immediately  following 
a  PE  at  (/.  j],  provided  that 


0^  ^  ^  ^mod  M 
j  _  )  M 

~  Omod.  M 

+  1Wd 

where,  M  is  the  number  of  input  data  blocks. 


y0i>i 


,  fr-1)/Vf</<r/Vf  r  odd 

for  <  .  ’ 

n  =  r  M 

,  \{r  -  ^)  M  <  i  <  r  M  r  even 

for  s'  A4 

[I  =  r  M 


(a)  Directional  encoded  values  (b)  Data  propagation  between  array  PEs 

Fig.  6.  Propagation  of  data  between  array  PEs. 

The  two  systolic  array  architectures  with  spiral  structure  of  intercell  con¬ 
nections,  proposed  to  compute  equation  (4)  for  class  1  and  equation  (5)  for  class  2 
(for  N  =  12)  are  depicted  in  Fig.  7(a)  and  (b),  respectively  [16],  The  secondary  input 


signal  x{k  —  /V  -f  n)  is  shown  integrated  into  the  structure.  In  this  structure,  the  PE 
performs  the  following  operation,  as  illustrated  in  Fig.  8: 

^ont  =  ^in  ,  . 

Yout  =  Yin  +  [\n  ±  X'in] 

Here  x(k  -  N  +  n)  is  the  past  Input  data,  symmetric  to  x{k  -  n  +  ^).  and 
N  -  2/  +  1  sampling  periods  apart. 


X(I7)  0  0  *(20)  X<1*)  0  0  *(19)  !-J 

*(I3)  0  0  *(16)  *(14)  0  0  *(13)  C-4 

*(9)  0  0  *(12)  *(10)  0  O  *(11)  C-3 

*(5)  0  0  x(g)  »(6)  0  O  *(7)  t-2 

x(l)  0  0  *(4)  x(2)  0  0  *(3)  I-  I 

0  0000  00  0  t-0 


*(10)y(13)  y(16)»(ll)  x(9)  y(14)  y(13)  *(8)  t-9 

*(60  y(ll)  y(12)*(7)  *(3)  y(10)  y(9)  *(4)  t-S 

»<2)  y<7)  y(8)  *(3)  *(1)  y(6)  y<3)  0  1-7 

0  y(3)  y<4)  0  0  y(2)  y<l)  0  t-6 


*(17)  0  0  *(20)  *(18)  0  0  *(19)  t-3 

*(13)  0  0  *(16)  *(14)  0  0  *(13)  1-4 

*(9)  0  0  *(12)  »(10)  0  0  <11)  t-3 

<5)  0  0  <8)  <6)  0  0  <7)  t-2 

<1)  0  0  <4)  <2)  0  0  <3)  t  -  1 

0  OOOo  000  t-0 


<3)  0  0  y(2)  y(4)  0  0  <1)  t-5 


(a)  Class  1  (b)  Class  2 

Fig.  7.  Spiral  systolic  array  architecture  for  LPF  filter  of  order  12. 
(Notations:  +  for  even  symmetry  and  -  for  odd  symmetry.) 
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Fig.  8.  Inputs  and  outputs  of  a  PE. 

Let  U,  the  reduced  filter  order,  and  M,  the  number  of  input  data  blocks,  desig¬ 
nate  the  spiral  SA's  number  of  rows  and  columns,  respectively.  Denote  indx,  (i.j,  t) 
and  indXo(/,y,  f)  to  be  the  respective  indices  of  input  and  output  signals  of  a  PE, 
where  /,  j,  and  t  are  the  row,  column  and  block  kernel  cycle.  Here  /=  1,  .  ,  U. 
7=1,  ...,  M,  and  f  =  1,  (T^ax  =  (K/M  +  .5),n,eger),  where  K  is  the  number  of  input 

signals.  Note  that  the  indices  of  input  and  output  signals  are  related  by  the  following: 

indXg  (/,7.  0  =  indx,  {i.j.  t)  -  M  (9) 


The  following  algorithm  demonstrates  the  positioning  of  the  input  signals  in  the 
PEs  of  the  spiral  SA,  at  kernel  cycle  t. 


Algorithm  1.  {  *  Input:  spiral  SA's  row  and  column  size,  and  input  data  size 
Output:  input  signals'  indices,  indx,  (ij,  t).  *  } 

Begin 

^max-  ~  (K^/A/f  "i"  0.5)integer 

for  f;  =  1  to  do 

begin 


rM/2,  if  M  even 

j(/Vf-1)/2,  if  M  odd 


*  •  —  '  » 

for  j:  =  1  to  /W  do  {1st  row  computations  } 

begin 


indx,  [/,y,  f]: 


(f  -  1)  M -I- (/ 4- 1)/2,  if  j  odd 
tM  -  (j  -  2)12,  if  j  even 


1 


{  rows  2  to  A/  computations  } 


end; 

for  /;  -2  to  N  do 

begin 

p:  =  /  mod  2; 

for  j\  =  1  to  M  do 

indX/  [/,7,  f]:  =  indx,[/  -  1,y,  f]  -  M  -  1: 

,  .X  if  /  even 

for(,:  =  l  to  I,  , 


do 


begin 

y1:  =  2q  +  p  -  2;  j2:  ^  , 

if  j2  <  =  M  then 

begin 

mid;  =  indx,  [/.y1,  f];  {  swap  of  indices  } 

indx,  [/.y1,  f]:  =  indx,[/,y2,  f]; 
indx,  [/,y2,  f]:  =  mid  ; 
end;  {y2  <  =  M} 

end; 

end;  {;:  =  2,  N] 
end;  {f=1,rmax} 


End. 


II 


The  index  of  some  feedback  input,  { ,  is  related  to  the  index  of  its  destination  PE, 
k  by  (see  eq.  4  and  Fig.  7). 

if  =  k  -\-Ak^ 

Ak^  =  {k  -  N  +  n)- {k  -  n  +  ^)=  ~  N +  2n  -  ^  (10) 

/  =  /(  +  2n-A/-1 

An  example  for  the  PE  input/output  data  distribution  in  the  spiral  SA,  with 
NjZ  =  f  and  M  =  4  dimensions  (at  kernel  cycle  f  =  9)  is  shown  in  Fig.  9: 
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Fig.  9.  PE  input/output  data  distribution  for  kernel  cycle  t  =  9. 


The  connection  matrices  for  the  feedback  input  data  of  the  class  1  and  class  2 

spiral  SA  (LPF  of  order  12)  are  shown  here.  (Note  that  =  6,  M  =  4  for  class  1  and 
(/V-1)  2 

- — - =  5,  /W  =  4  for  class  2.) 


class  1: 


±  0  (3.  4) 
±  0  (3.  2) 
±  0  (4,  4) 
+  0  (4,  3) 
±  0  (5.  3) 
±  0  (5,  4) 


±  0  (2,  2) 
±  0  (3.  1) 
+  0  (4,  2) 
±  0  (4,  1) 
±  0  (5.  1) 
±0(5,2) 


±  i (3,  2) 
±  0  (3,  4) 
±  i  (4.  3) 
±  i  (5.  4) 
±  i (5.  4) 
±  i (6.  4) 


±  0  (2,  t) 
±0(3,3) 
±  0  (3.  2) 
±  0  (4.  2) 
±  0  (5.  2) 
±  0  (6.  2) 


(11) 
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class  2: 


r-  ±  i  (3,  2) 
±0(3.  1) 
±  / (4,  3) 
±  0 (4. 1) 
_  ±  /  (5.  4) 


±0(2.4)  ±0(2,1) 

±0(3.3)  ±/(3.2) 

±  0  (4.  4)  ±0  (3.  2) 

±  0  (4.  2)  ±0  (4.  3) 

±  0  (5.  3)  ±  i  (5.  2) 


±  0  (2.  2)- 
±0(3.4) 

±  0  (3.  1) 

±  0  (4.  4) 

±  0  (5.  1)_ 


(12) 


where  0{ij)  and  l{ij)  at  the  (m,  n)  location  of  the  connection  matrix  refer  to  the  con¬ 
nection  made  from  the  respective  output  and  input  of  the  PE  at  location  (/.y)  to  the 
input  of  the  PE  that  makes  up  the  processor's  feedback  input  data  at  the  location 
{m,  n)  of  the  spiral  SA. 


The  spiral  SA  structure  for  computing  eqs.  A4  and  A5  for  the  DFT  and  IDFT 
algorithms  and  for  computing  eqs.  A2  and  A3  for  the  convolution  and  correlation 
algorithms  are  illustrated  in  Appendix  C  [20],  Note  that  contrary  to  the  LPF  spiral 
SA  structure  neither  structure  requires  a  feedback  input,  and  in  addition,  the  DFT  and 
IDFT  spiral  SA  structures  consist  of  no  delay  elements. 


2.2.2  Asynchronous  Architecture 

The  performance  of  a  systolic  array  can  be  improved  further  if  an  asynchronous 
approach  is  adopted.  The  idea  is  to  design  self-timed  processors  and  communi¬ 
cation  protocols  to  gain  control  of  data  streams,  so  that  each  computation  can  start 
if  all  its  data  are  available.  The  proposed  array  is  a  hybrid  of  a  systolic  array  and  a 
data-flow  machine. 

In  the  asynchronous  design,  instead  of  using  a  global  clock,  self-timing  PEs  and 
a  communication  protocol  are  employed.  The  advantage  is  the  following:  The  whole 
period  of  a  clock  unit  for  addition/subtraction,  multiplication,  addition,  and  data 
routing  can  be  separated  into  several  small  steps,  and  some  of  these  steps  can  be 
executed  simultaneously.  The  concept  of  asynchronous  computations  can  be  speci¬ 
fied  as  showi  below,  with  steps  2  and  3  executed  simultaneously. 

Step  1 

•  send/receive  the  respective  acknowledge  signal  and  data  to  the  previous  pro¬ 
cessor 

•  send  a  request  signal  that  simultaneously  forwards  data  to  the  next  processor 
Step  2 

•  transfer  data  to  the  next  processor 
Step  3 

•  add/subtract,  multiply,  and  accumulate  the  results 


1.1 


The  basic  features  of  the  proposed  array  remain  the  same  as  in  the  systolic 
array,  with  the  exception  that  the  data  routing  and  computing  in  each  PE  can  be 
operated  simultaneously.  Moreover,  a  PE  does  not  wait  for  data  until  the  previous 
PE  completes  its  computations.  The  following  algorithm  reflects  this  new  feature  for 

an  LPF  digital  filter  [16]. 


Algorithm  2:  Linear  Phase  FIR  Digital  Filter  (Asynchronous) 

Begin 

While  there  are  data  entering  PE,  do 
Begin 

Receive  input  data  x,„ ,  x',„ ,  y,„  ; 

Send  X,  to  the  next  processing  element  : 

&  V,  ,,  ♦-  x,„  ±  x':„  ; 

Ui  I  *—  (Oj  Vi  i  ; 

Yout  Ui  j  +  Yin  ; 

End  While 


Protocol  realization  for  Algorithm  2 

A  protocol  for  realization  of  Algorithm  2  must  control  the  flow  of  data  and  make 
the  data  flow  independent  of  the  internal  operations  in  each  PE,  such  that  the  values 
of  input  variables  are  not  overwritten  during  their  computing  time  intervals.  Because 
in  spiral  SAs  timing  is  crucial  for  the  SYnchronization  of  the  PEs'  input  data,  and  in 
asYHchronous  machines  operations  are  triggered  by  the  availabilitY  of  the  data,  the 
system  protocol  for  Algorithm  2  (1)  triggers  PEs’  operations  by  the  fullness  of  their 
x,„,  x',„,  and  y,„  input  ports,  and  (2)  allows  data  routing  and  computing  to  be  performed 
simultaneously.  In  the  proposed  '  'o*ocol  for  an  LPF  digital  filter,  five  types  of  sig¬ 
nals  —  REQ,  ACK,  FLG,  EMP,  FUL  —  are  introduced,  two  of  which  are  external  sig¬ 
nals  and  three  of  which  are  internal  signals.  The  REQ  signal  reports  to  the  next  PE 
that  the  data  in  its  output  port  are  ready  for  transmission.  The  ACK  signal  reports  to 
the  previous  PE  that  its  input  port  is  ready  to  receive  new  data.  The  FLG.  EMP,  and 
FUL  signals  indicate  the  completeness  of  the  add/subtract-multiply-and-add  compu¬ 
tations,  and  the  emptiness  and  fullness  of  the  input  port(s),  respectively.  The  proto¬ 
col  can  be  described  formaily  for  data  X;„  as  shown  below,  while  noting  that  the 
specifications  of  the  protocols  for  x',„  and  y,„  are  similar. 

1.  PE, j  receives  a  request,  REQ*"’  from  a  source  PE  when  the  data  in  its  output  port 
are  ready  to  be  transmitted. 
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2.  PE,y  receives  an  acknowledge  ACK""  from  a  destined  PE  when  its  input  port  is 
ready  to  accept  new  data. 

3.  PE,y  has  three  internal  signals,  FLC*",  EMP*'",  and  FUL*'’,  which  indicate  the  com¬ 
pleteness  of  the  add/subtract-multiply-and-add  operations,  and  the  emptiness 
and  fullness  of  the  input  port(s). 

4.  PE,;  contains  the  following  transition  latches: 

Latch  1  activates  the  signal  ACK'",  which  is  fired  toward  a  source  PE  only  after  both 
signals  REQ’'”  and  EMP'"  are  received. 

Latch  2  controls  the  data  flow  from  the  input  port  to  the  buffer-for-add/subtract  and 
to  the  output  port,  and  it  also  activates  the  signals  EMP'"  and  FUL'".  It  is  fired  only 
after  the  signal  FLG'"  is  received. 

Latch  3  controls  the  data  flow  from  the  output  port  of  the  PE  to  the  input  port  of  a 
destined  PE.  It  is  fired  only  after  the  signal  ACK*"’  is  received  from  the  destined  PE. 

A  detailed  configuration  of  this  protocol  for  type  I  cell  (case  1  or  3  FIR  linear 
phase  digital  filters)  is  depicted  in  Figure  10.  In  this  configuration,  there  are  two  lines 
that  intersect  each  PE.  One  is  a  bidirectional  control  line  for  the  transmission  of  the 
ACK  and  REQ  bit-signals,  and  the  other  is  a  unidirectional  data  line.  The  linear 
phase  FIR  digital  filter  operations  in  PE, ,  are  shown  in  Algorithm  3  [16]. 

The  PE  protocols  for  some  of  the  most  important  DSP  algorithms  are  shown  in 
Appendix  D  [17-19], 
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Algorithm  3:  Linear  Phase  FIR  Digital  Filter  Operation 
(A  Completed  Asynchronous  Version) 

Begin 

While  there  are  data  entering  PE,  ^  do 

Begin 

For  all  PE,  ;S  asynchronously  do 

Begin 

Walt  for  REQ,„pu,  from  a  source  PE  ; 

Receive  REO,„pu(  . 

While  input  port  is  empty  do 
Begin 

Activate  REQ/„pu(  &  EMP  ; 

Send  ACK,ppu,  ; 

Receive  data  x,„ ,  x',„ ,  y,„  : 

Store  data  into  the  input  port  ; 

Activate  FUL  signal  ; 

&  Send  REQoofp,,,  to  a  destined  PE  ; 

Walt  for  ACKpufpu,  from  the  destined  PE  ; 

Send  x,n  to  output  port  &  buffer-for-add  : 

&  Add  (v,,y  x,„  +  x',„)  ; 

Store  Vi  i  into  buffer-for-mult  : 

Read  o),  from  buffer-for-mult  ; 

&  Vi j  from  buffer-for-mult  ; 

Mult  (a,  y  ♦-  (i),v,  i): 

Store  u,  I  into  buffer-for-add  : 

Read  u,  y  from  buffer-for-add  : 

&  Yin  from  buffer-for-add  : 

Add  (You,  u, ,  +  y,„)  ; 

Store  yo,„  into  buffer-for-W  : 

&  Activate  FLG  signal  : 

End  While 

End  For 
End  While 

End 
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Fig.  10.  Proposed  protocol  in  (type  I)  PE, 
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2.3  Performance  Aspect 


The  execution  time  modeling  of  an  asynchronous  system  can  represent  quanti¬ 
tative  evaluation  of  the  developed  algorithms.  We  developed  three  performance 
measures,  throughput,  spared  up,  and  efficiency,  for  evaluating  the  asynchronous 
spiral  SA  realizations.  The  performance  analysis  technique  is  next  demonstrated  for 
the  LPF  digital  filter  asynchronous  SA  realization  [16].  The  following  notations  will 
be  used  to  describe  the  time  models  of  the  arrays: 

T,„  =  time  for  reading  data  from  input  port 

To,  =  time  for  internal  data  transfer 

To„  =  time  for  external  data  transfer 

Tf,  =  signal  REQ  propagation  time 

Ta.  =  signal  ACK  propagation  time 

Tpire  =  time  for  firing  data  via  latch 

Tc  =  clock  time  for  synchronous  systolic  array  realization 

Te  =  signal  EMP  propagation  time 

Ta^  —  computing  time  for  addition 

Tm  =  computing  time  for  multiplication 

Tw  =  time  for  fetching  data  from  buffer-for-add/multiply/etc. 

Tf  =  signal  FLG/FUL  propagation  time 
Toc',  =  lifne  for  reading  data  from  output  port 

The  time  delays,  starting  with  the  reading  of  data  from  input  port,  for  the 
respective  asynchronous  spiral  SA  and  the  corresponding  synchronous  systolic 
array  can  be  described  by  the  following  relationships  (see  Figure  10): 

-I-  Tq,  +  (Tp,  -I-  4-  +  Tp,yp)  f  Tq,,,  +  TqJ  . 

((Till  +  7"o,  +  -I-  -I-  I-  fiy  I-  +  r^y)  l-  (Tp  f  Tp,^^)  I-  (13) 

+  '^Fire)  ^Ou/  + 

^  =■-  <T/n  +  ^0,  +  '^OiJt  ^  f  (14) 

where, 

"^Cnm  ~  “^Ai,  ‘^'^Fire  ^ 

and, 

^  ^  i'^ln  ^0,  '*■  '■  3  ^  ^  ^ 

=  M'(T,„  +  +  3  (r„  +  r„))  =  MT, 

where,  M'  refers  to  the  number  of  PEs  in  the  synchronous  systolic  array,  and 

|/Vf  +  1  iffVodd,  even  symmetry 
\m  otherwise 


The  throughput  can  be  obtained  by  dividing  the  number  of  processors  by  the 
execution  time  per  sample.  Thus,  the  throughput  using  M  x  U  processors,  RiM,,)  is 
defined  by  the  following  equation; 


R(M^)  = 


_ M _ 

+  ^O,  +  ^Out  +  +  Tm  +  T(;^om 


(16) 


The  speedup  is  the  ratio  of  the  throughput  in  the  asynchronous  spiral  SA  system 
to  that  in  a  reference  synchronous  system.  From  above,  the  speedup  via  an  asyn¬ 
chronous  spiral  SA  realization  is  represented  as 


^  (T/n  +  ^0,  +  ^ 


i^ln  +  ^D,  +  ^Out  +  ^  ^  ^Cnm) 


(17) 


The  system  efficiency  is  defined  as  the  ratio  of  the  speedup  and  the  number  of 
processors 


EiM,)  = 


'Efn  +  T'o  +  T’o  +  3  +  Tjv) 


U{M')  (Ti„ 

^0,  ^Out  +  -I-  2Ta  f  Tf^  f  T^pn,) 


18) 


The  above  equation  indicates  that  the  reduction  of  interprocessor  communi¬ 
cation  time,  Tcow,  greatly  improves  the  system's  efficiency  [16], 
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APPENDIX  A 


THEORETICAL  ASPECT  FOR  DSP  ALGORITHMS 


—  Arbitrarily  Large  LMS  Adaptive  Filters 

A  standard  form  for  the  LMS  algorithm  when  real  discrete  time  signals,  x,.  are 
used  is  described  by 

{A^) 

where  Wi  and  X,  are  vectors  representing,  respectively,  the  filter  weights  and  the 
outputs  of  the  transversal  filter  line  at  time  j 

Wl  =  [  IV,  y.  W2  J . 

Xj  =  Xj-^<  +  i  ] 

and  ^  is  the  gain  constant  that  regulates  the  speed  and  stability  of  adaptation.  The 
error.  is  the  difference  between  the  desired  response  d,,  an  externally  supplied 
training  signal,  and  the  filter  output  yj 

=  cfy  ~  Yj 

where  the  filter  output  is  given  by 

Vi  = 

—  Convolution  and  Correlation 

Convolution  and  correlation  are  closely  related  operations  that  are  basic  to 
many  areas  of  digital  signal  processing.  The  convolution  of  two  time  series  is 
equivalent  to  the  product  of  the  Fourier  transforms  of  the  two  time  series.  They  can 
be  expressed  for  a  finite  length  N  as 

w  -  1  /V  -  1 

Convolution:  c^y(n)  =  conv\[x(k),  y{k)]]  =  ^  x{k)y{n  -  k)  =  ^  x(n  -  k)y{k)  (A2) 

fe  =  0  k  =  0 

/V  -  1  /V  -  1 

Correlation:  r^y{n)  =  corr{[x(/f),  y{k)]]  =  ^  x{k)y{n  +  k)  =  ^  x(n  +  k)y{k)  (A3) 

k  =  0  fc  =  0 

—  OFT  and  IDFT 

The  discrete  Fourier  transform  (DFT)  is  used  to  transform  an  ordered  sequence 
of  data  samples,  usually  from  the  time  domain  into  the  frequency  domain,  so  that 
spectral  information  about  the  sequence  can  become  known  explicitly.  The  inverse 
discrete  Fourier  Transform  (IDFT)  is  used  to  obtain  a  data  sequence  in  time  domain 


2,1 


The  Hadamard  matrix  of  the  lowest  order  becomes 


and  the  Hadamard  matrix  of  order  N  =  4  is  given  as 

1111] 

1-1  1  -  1  _  [^2  ^2 
1  1-1-1  '  [^2  -  ^2 
1-1-1  1 J 


(/17) 


M8) 
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APPENDIX  B 


SYSTEM  ASPECT  1(A) 

—  Arbitrarily  Large  LMS  Adaptive  Filters 
—  Computational  graph  for  y, 


—  Computational  scheme  for  y, 
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APPENDIX  D 


SYSTEM  ASPECT  II 

—  One-Step  Adaptive  Lattice  Predictors'  PE,„  Protocol 
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Data  Buffer 

1  MULTI 

Multiplication  Transition 

Request  Flag 

(REQ/ACK)  s^j,  (REQ/ACK) 
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[m]  0  I'll] 


—  Arbitrarily  Large  LMS  Adaptive  Filters  PE*  Protocol 
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